-
Notifications
You must be signed in to change notification settings - Fork 155
AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations #710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PP performance is about the same as using q8_k_r8 on the Ryzen-7950X, so we expect nice gains on Zen5, and we don't need to wory about using 2 different q8_k_r8 implementations for fancy SIMD.
I tried to test with With GPU in the mix, I saw insignificant degradation of performance on the PR710 compared to the main branch (PP 1091.02 vs 1094.84 t/s, TG 18.65 vs 18.79 t/s). When running without GPU (compiled with
Error: Compiled using:
Full log
Not sure what I am doing wrong. |
Did one quick comparison between this PR and main on my AMD 9950X compiled CPU only and running this Qwen3-30B-A3B-Thinking-2705-IQ4_KSS which isn't perfect as it has some Despite that, still seeing Prompt Processing uplift of around 15-33% at low kv-cache depth depending on how I'm running it. I'll try to get some more testing in soon including on that remote AMD EPYC 9965 192-Core rig where it could definitely help given I run it CPU-only. 👈 Detailscmake -B build -DGGML_CUDA=0 -DGGML_VULKAN=0 -DGGML_BLAS=0
cmake --build build --config Release -j $(nproc)
./build/bin/llama-sweep-bench \
--model "$model" \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-c 10240 \
-rtr \
--threads 16 \
--warmup-batch ik_llama.cpp ik/q8_k_r16@f3edfe0f -rtr -ctk q8_0 -ctv q8_0
ik_llama.cpp main@0cb66969 -rtr -ctk q8_0 -ctv q8_0
ik_llama.cpp ik/q8_k_r16@f3edfe0f --no-mmap -ub 4096 -b 4096 -ctk q8_0 -ctv q8_0
ik_llama.cpp main@0cb66969 --no-mmap -ub 4096 -b 4096 -ctk q8_0 -ctv q8_0
ik_llama.cpp ik/q8_k_r16@f3edfe0f --no-mmap -ub 4096 -b 4096
ik_llama.cpp main@0cb66969 --no-mmap -ub 4096 -b 4096
![]() Hrmm, your Likely unrelated to the ASSERT you mention, a few things with your command:
Everything else looks reasonable to me. Not sure what is up with that ASSERT when compiling CPU-only which is a good way to test this PR. |
@ubergarm Yeah, thank you for the notes. I have a universal semi-interactive script for running these tests more easily, so there are params that only affect the llama-server or some other models. Anyway, this is what CPU-Z shows inside the VM: ![]() I didn’t see any significant gains last time when I tested #610 on Windows on bare metal either. Not sure why - it might just be Windows. With Proxmox in the way, things got even more complicated and obscured. I expected some drop in performance due to virtualization, and I did see that on most metrics (around 5–10%), but interestingly RAM read bandwidth doubled when measured by OCCT, now well over 1000 GB/s - probably an impossible value for the hardware and difficult to explain. And don’t even get me started on CPU threads and core pinning - it’s a mess! Good to see improvements with this PR from other people. I’m sure I’ll figure out the oddities of my setup eventually. |
Thanks for testing. Short of downloading the exact same model and running the exact same command as you I'm absolutely not able to reproduce the assert. But I have pushed a change that hopefully fixes that. Also, because there is some doubt if you are actually enabling the faster "fancy" SIMD path, I have added a hack that will tell us if it is enabled. If you pull the latest, rebuild, and run any command, you will in then output either
or
Oh, one more thing: |
Thanks for testing! It looks like in your case With -rtr
Without -rtr
In any case, it would be useful to have the benchmark results for the Qwen3-14B model that you used in #610 to know if this PR is as good as #610 |
I've been thinking about improvements to sweep bench that would sweep with varying or multiple batch sizes, and then do TG afterwards testing at varying points (I don't like how little data you get about TG when testing with large batch sizes, or how little you can configure how quick/thorough to be) but with a configurable amount of tokens at a time (independent of batch size). I haven't gotten around to doing it, and not even sure how much people would want it. |
I have been thinking about that too. On a number of occasions I have wished that I can define the step size of the sweep independently from the batch/u-batch being used and the number of TG tokens. But it never became painful enough so I sit down and do it. In any case, for the tests being done for this PR running CPU only, my point was to not use such huge batch and u-batch sizes. It is also not necessary to go to very high context lengths. |
Yep. Although I have occasionally hard coded 32 instead of ubatch/4, but the better solution would do TG differently (PP pass to build the KV, TG breaks it down by removing from the cache until you are wherever you want to be to collect data). |
No, good point, I had not tried the default batches without rtr, and in like your 7950X without ![]()
I'll try to keep it simple and just use default batches without |
So with this quant, PR710 > PR610 > main for PP speed. I do have that old ![]() 👈 Details./build/bin/llama-sweep-bench \
--model "$model" \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-c 10240 \
--no-mmap \
--threads 16 \
--warmup-batch ik_llama.cpp PR710 ik/q8_k_r16@f3edfe0f --no-mmap -ctk q8_0 -ctv q8_0
ik_llama.cpp main@0cb66969 --no-mmap -ctk q8_0 -ctv q8_0
ik_llama.cpp PR610 ik/q8_k_r8_avx512+main@0cb6696 --no-mmap -ctk q8_0 -ctv q8_0
|
Okay for the test quant I used on PR610 Should I try to make a ![]() 👈 Details/build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 8704 \
--warmup-batch \
--threads 16 ik_llama.cpp PR710 ik/q8_k_r16@f3edfe0f
ik_llama.cpp main@0cb66969
ik_llama.cpp PR610 ik/q8_k_r8_avx512+main@0cb6696
|
One more quick test using a "pure" IQ2_KT which is not involved in these PRs pretty sure given the speed is the same across all three test cases. ![]() 👈 Details./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-ctk q8_0 -ctv q8_0 \
-c 8704 \
--warmup-batch \
--threads 16 ik_llama.cpp PR710 ik/q8_k_r16@f3edfe0f
ik_llama.cpp main@0cb66969
ik_llama.cpp PR610 ik/q8_k_r8_avx512+main@0cb6696
|
Okay sorry for spamming this thread lol, one more example of a "pure" IQ2_KL which is used in this PR psure (there is a commit comment for it, but I believe it was omitted in the PR description). For this case again we see PR710 > PR610 > main for PP speed. ![]() 👈 Details./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-ctk q8_0 -ctv q8_0 \
-c 8704 \
--warmup-batch \
--threads 16 ik_llama.cpp PR710 ik/q8_k_r16@f3edfe0f
ik_llama.cpp main@0cb66969
ik_llama.cpp PR610 ik/q8_k_r8_avx512+main@0cb6696
|
@ubergarm Thanks for the thorough testing! It seems to be even slightly better than #610!
I haven't even implemented the ability to quantize a model to So, the only thing left to resolve before merging is the assert observed by @sousekd |
All sorted. I don't see the assert now even with the |
This is an alternative to #610
The expectation is that it will significantly improve CPU-only prompt processing speed on "true"
AVX512
CPUs that support theAVX512VNNI, AVX512VL, AVX512BW,
andAVX512DQ
extensions (e.g., Zen5). These extensions are also supported by Zen4 cores, but these are not "true"AVX512
CPUs as there 512-bit instructions are performed as two 256-bit instructions. The main benefit compared to #610 is that on such CPUs performance is about the same as on main, unlike #610, where we get a 10-15% performance penalty.IQ2_XXSThe PR adds a new quantization type,
Q8_K_R16
, which is only used for quantizing activation tensors.IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_XS, Q2_K, Q3_K, IQ2_KS, IQ2_K, IQ3_KS, IQ3_K, IQ4_KSS, IQ4_KS, IQ4_K, IQ5_KS, IQ5_K, IQ6_K
all use this new quantization type for GEMM when the batch size is>= 32
and "fancy SIMD" are available (i.e., the aboveAVX512
extensions are supported by the CPU).If you have a CPU that supports
AVX512VNNI, AVX512VL, AVX512BW,
andAVX512DQ
extensions (Ryzen 99XX, EPYCs of the 9005 series (a.k.a., Turin), recent Intel Xeon CPUs), please test and let me know if this PR improves CPU performance.@ubergarm Pinging you explicitly.